-
Notifications
You must be signed in to change notification settings - Fork 539
feat: support Utf8View and BinaryView #5685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
It seems that the compilation error(https://github.com/lance-format/lance/actions/runs/20890689101/job/60043036449?pr=5685) has nothing to do with the modifications made in this pr. @wjones127 |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
4123e18 to
80f2d92
Compare
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to accomplish this without having to convert in so many places.
Also, could you validate FTS indices build correctly? I think one function you'll need to update is here:
73f8249 to
29c016b
Compare
|
@wjones127 please review this pr again |
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes looking good so far.
Note that before we merge this change, we'll need to change the spec to add these new data types first. So the next steps (which I've started are):
- Post a discussion about the changes #5817
- Vote on the spec changes (1 week voting period)
- If the vote passes, merge the spec changes
- Merge this PR
rust/lance-encoding/src/data.rs
Outdated
| // Check if we need to convert offsets from i32 to i64 (for LargeUtf8/LargeBinary) | ||
| let offsets_buffer = if self.bits_per_offset == 32 | ||
| && (data_type == DataType::LargeUtf8 || data_type == DataType::LargeBinary) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offsets conversion logic at lines 587-590 is necessary for backward compatibility. It handles the case when reading legacy data that was stored with 32-bit offsets but is now being read as LargeUtf8/LargeBinary (which use 64-bit offsets).
This is not introduced by this PR - it's existing code to handle historical data formats. I've added more detailed comments to clarify this purpose.
For Utf8View and BinaryView types in this PR, this conversion is not triggered because we store them with 64-bit offsets (as implemented in arrow_view_to_data_block).
bbe0248 to
6fc8dfc
Compare
close #5172 , add support of
Utf8ViewandBinaryView, test these two types intest_query_string